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Abstract. First, we describe two types of assessment (problem 
solving and standard treatment protocol) within a "responsive- 
ness-to-instruction" framework to identify learning disabilities. 
We then specify two necessary components (measures and classi- 
fication criteria) to assess responsiveness-to-instruction, and pres- 
ent pertinent findings from two related studies. These studies 
involve databases at grades 1 and 2, which were analyzed to com- 
pare the soundness of alternative methods of assessing instruc- 
tional responsiveness to identify reading disabilities. Finally, 
conclusions are drawn and future research is outlined to pros- 
pectively and longitudinally explore classification issues that 
emerged from our analyses. 
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Over the 25-year history of the Individuals with 
Disabilities Education Act (IDEA), the number of stu- 
dents identified as having learning disabilities (LD) has 
increased dramatically. Prior to 1970, students with LD 
were rarely identified. Now, they comprise more than 
50% of all children with disabilities, or 5% of the 
school population (U.S. Department of Education, 
2000). The dramatic increase in the prevalence of LD 
has raised concerns about the methods by which these 
children are identified. 

This concern, we believe, is well founded. Because LD 
is defined as unexpected failure to learn, the discrep- 
ancy between intelligence and achievement has been 
the keystone in the process by which LD is typically 
identified. Yet, the measurement of discrepancy is 
problematic because of the poor reliability of difference 
scores (Reynolds, 1984), and because practitioners' use 
of varying discrepancy formulae and test instruments 


tend to identify different students (e.g., Shepard, 
Smith, & Vojir, 1983). Moreover, research documents 
similar underlying deficits in children with reading dif- 
ficulties whether or not they demonstrate discrepancies 
between intelligence and achievement (Fletcher et al., 
1998; Fletcher et al., 1994; Francis, Fletcher, Shaywitz, 
Shaywitz, & Rourke, 1996; Velutino et al., 1996). 

These and other problems have prompted calls for 
alternative identification methods (e.g., Lyon et al., 
2001; Siegel, 1989). One alternative approach is respon- 
siveness-to-instruction, or RTI. With RTI, students are 
identified as LD when their response to generally effec- 
tive instruction (i.e., instruction to which most chil- 
dren respond) is dramatically inferior to that of their 
peers. The basic assumption is that RTI can differenti- 
ate between two explanations of low achievement: 
poor instruction versus disability. If a child is nonre- 
sponsive to instruction that benefits a majority of stu- 
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dents, the assessment process eliminates poor instruc- 
tion as an explanation for the child's inadequate 
growth. Instead, it suggests that disability is responsible 
and that specialized intervention is necessary to boost 
academic achievement and chances for post-school 
success. 

RTI has generated considerable attention. The U.S. 
Department of Education's Office of Special Education 
Programs recently sponsored a series of white papers 
and an LD Summit (see Bradley, Danielson, & Hallahan, 
2002), partly to explore the viability of RTI. The Presi- 
dent's Commission on Excellence in Special Education 
(2002) and a National Academy of Sciences committee 
on overrepresentation of minority students in special 
education (Donovan & Cross, 2002) also encouraged 
consideration of its use. Moreover, an entire issue of 
Learning Disabilities Research and Practice (Vaughn & 
Fuchs, 2003) was recently devoted to the topic. 

Despite this mostly positive attention, many ques- 
tions about RTI remain unanswered. For example, the 
social consequences of such a reorientation to LD iden- 
tification, including prevalence rates, equity issues, and 
prevention outcomes, are yet to be studied. There are 
questions, too, about what measures of and criteria for 
instructional responsiveness should be used to yield 
reliable and valid decision-making. 

In this article, we focus on assessment for identifica- 
tion of reading disability. By some estimates (Lyon, 
1995), 80% of students with LD suffer their most serious 
academic difficulties in reading. Although, in the earli- 
est grades, this mostly involves word analysis and word 
identification, eventual problems include reading flu- 
ency and comprehension (Gough, 1996; Perfetti, 
Marron, & Foltz, 1996; Shankweiler et al., 1999), which 
grow more serious as the school curriculum focuses 
increasingly on reading for meaning and for learning 
new information in the later grades. 

We begin by explaining conceptual and technical 
strengths and weaknesses of two forms of RTI for 
reading disability identification. We then specify the 
components necessary to assess instructional respon- 
siveness, and present data from two recent and perti- 
nent studies, in which we explore the technical 
soundness of alternative operationalizations of instruc- 
tional responsiveness. We conclude by outlining 
prospective and longitudinal research to examine iden- 
tification and classification issues. 

READING DISABILITY AS RTI: TWO 
CONCEPTUAL APPROACHES 

Problem-Solving in General Education 

An RTI approach to identifying disability is rooted in 
a 1982 National Research Council study (Heller, 


Holtzman, & Messick), which proposed that the valid- 
ity of any special education classification must be 
judged according to three criteria: (a) that mainstream 
education was generally effective; (b) that special edu- 
cation improved student outcomes, thus justifying the 
classification; and (c) that the assessment process used 
for identification was valid. Only when all three crite- 
ria are met, claimed Heller et al., was a special educa- 
tion classification justifiable. 

Fuchs (1995) borrowed the Heller et al. (1982) frame- 
work (see also Fuchs & Fuchs, 1998) to specify a three- 
phase process to assess disability. In Phase I, the rate of 
growth of all students in a mainstream classroom is 
tracked. The purpose of such classwide assessment is to 
determine whether the instructional environment is 
sufficiently nurturing to expect student progress. If, 
across all students, the mean rate of growth is low in 
comparison to other classes of children in the same 
building, the same district, or the entire nation, the 
appropriate decision would be to intervene at the class- 
room level to develop a stronger instructional program 
for all. 

After establishing that classroom instruction is gener- 
ally effective, Phase II assessment commences with the 
identification of students whose level of performance 
and rate of improvement are well below those of class- 
room peers. The purpose of this assessment, therefore, 
is to identify a subset of children whose potential aca- 
demic failure is signaled by their unresponsiveness to 
generally effective instruction. For only these children, 
the next phase, Phase III assessment, includes problem- 
solving and systematic tryouts of individualized adap- 
tations in the mainstream setting. The purpose of 
problem solving and adaptations is to determine 
whether the general education classroom can be trans- 
formed into a productive learning environment for 
these at-risk students. Only when such adaptations fail 
to improve student growth do practitioners consider 
special services. The assumption is that if the individu- 
alized adaptations do not produce growth for the at- 
risk students, some inherent deficit or disability is 
probably making it difficult for them to benefit. 

To conduct Phase 1, II, and III assessments, Fuchs 
(1995) suggested curriculum-based measurement 
(CBM; Deno, 1985), an approach that permits model- 
ing of student responsiveness to instruction. In Phase I, 
CBM quantifies "classroom instructional quality" as 
mean performance level and growth rate for the entire 
class. In Phase II, "risk" is defined as a dual discrepancy 
(on CBM performance level and CBM growth rate) 
between the targeted at-risk student and classmates. In 
Phase III, CBM is used to index "responsiveness to class- 
room adaptations," with the goal of boosting the at-risk 
student's CBM level and rate within the range of the 
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class mean. Fuchs provided data to show how CBM 
meets important standards with respect to Heller et al.'s 
(1982) third criterion: that the assessment process used 
for classification, requiring judgments about the quality 
of the instructional setting and the student's respon- 
siveness in that setting, is accurate and meaningful. 

Standard Treatment Protocol 

To address at-risk students' learning problem in gen- 
eral education, Fuchs (1995) proposed a series of adap- 
tations teachers might incorporate in a routine way. 
More recently, others have reformulated Phase 111 in 
Fuchs's model to more strongly emphasize remediation 
of at-risk students' difficulties. Sometimes this is 
attempted through an iterative problem-solving 
process (e.g., Grimes, 2000; Marston et al., 2003). More 
commonly, an intensive fixed-duration trial (e.g., 10- 
15 weeks) of small-group or individual tutoring is used, 
involving a validated standard treatment protocol (e.g., 
Al Otaiba & Fuchs, 2004; McMaster, Fuchs, Fuchs, & 
Compton, in press; Vellutino et al., 1996). If the stu- 
dent responds to an intensive treatment trial, she is 
seen as remediated and disability-free and is returned to 
the general education classroom for instruction. If, on 
the other hand, she is non-responsive, a disability is 
suspected and further evaluation is warranted. 

A recent study by Vaughn, Linan-Thompson, and 
Hickman-Davis (2002) illustrates this more recent stan- 
dard treatment protocol approach to RTI. Second-grade 
students at-risk for reading disability were assessed and 
provided 10 weeks of supplemental, small-group read- 
ing instruction. Afterwards, all who met a priori cut- 
points were no longer included in the supplemental 
instruction; remaining students were regrouped and 
provided another 10 weeks of instruction. This contin- 
ued for 30 weeks, when the subset of students who still 
had not met criteria for dismissal from supplemental 
instruction (25% of the original sample) were consid- 
ered for special education. 

This relatively intensive three-phase approach trans- 
forms an identification process into prevention. 
Variations by others on this preventive approach 
include different numbers of tiers, or phases, and differ- 
ent types of activities occurring at the various tiers (see 
Fuchs, Mock, Morgan, & Young, 2003, for discussion). 

Conceptual and Technical Distinctions 

Problem solving in general education and use of stan- 
dard treatment protocols represent two approaches to 
RTI. They differ both conceptually and with respect to 
technical issues. Each, for example, has its own implicit 
meaning of "responsiveness/non-responsiveness." Use 
of a standard treatment protocol provides a very rigor- 
ous test for non-responders and the presence of disabil- 
ity. Students, like those in Vaughn et al.'s (2002) study, 


participate in a research-backed, intensive, and iterative 
instructional process. In such circumstances, it makes 
little sense to point to poor or inadequate instruction as 
a cause of non-responsiveness. It makes more sense to 
consider disability as a cause. At the same time, use of 
a standard treatment protocol raises the question: Is it 
possible that some children who are responsive to 
instruction in a second or third tier of a multi-tier 
approach still have disabilities and, once returned to 
general education instruction without the intensity 
and systematicity of the standard treatment protocol, 
again demonstrate the same learning problems that 
first marked them as candidates for participation in the 
standard treatment protocol? In short, whereas the 
standard treatment protocol approach is likely to iden- 
tify "true” non-responders, is it also likely to identify 
"false" negatives? For example, in the Vaughn et al. 
study, a subset of children who met criteria for dis- 
missal from intensive tutoring subsequently failed to 
thrive in general education and eventually required 
additional attention. 

By contrast, an at-risk student's responsiveness to 
general education with individualized adaptations sug- 
gests that adequate learning will continue without fur- 
ther intervention. Students in a generally effective 
instructional classroom with adaptations, whose learn- 
ing is much worse than that of classroom peers, are 
likely to require the intensity of instruction special edu- 
cation is meant to provide. Moreover, defining "inter- 
vention" and "responsiveness/non-responsiveness" in 
general education presumes that disability should be 
assessed as it occurs under "normal” conditions: in the 
mainstream setting. This parallels contexts in which 
other psychological conditions are diagnosed. Ruling 
out disability only after intensive effort improves a 
condition seems akin to concluding that a patient 
never had cancer because surgery restored her to 
health. 

Regarding technical issues, problem solving and stan- 
dard treatment protocol approaches create different 
challenges. Relying on general education to assess 
responsiveness to instruction has the advantage of a 
normative framework referenced to the typical popula- 
tion. That is, responsiveness to generally effective 
instruction can be estimated for all students so that a 
normative profile can be generated to describe the full 
range of response. With general education instruction 
as the intervention, traditional cut-points (e.g., 1.5 
standard deviations below the mean) may be used to 
define disability. Such an approach requires measure- 
ment of all students. By contrast, it seems unlikely that 
a normative framework may be applied to the standard 
treatment protocol approach. Thus, logistics and logic 
seem to argue against exposing the full range of stu- 
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dents to an intensive tutoring regimen for the pur- 
pose of producing a normative profile. In all likelihood, 
practitioners would need to rely on a normative frame- 
work restricted to very poor readers, a proposition 
requiring empirical validation. 

In comparison to the standard treatment protocol 
approach, problem solving is usually associated with a 
lower bar to determine non-responsiveness and easier 
access to special education. Assuming that special edu- 
cation is effective, this helps ensure that all children 
with special needs receive appropriate services. Yet, rel- 
atively easy access to special education can, in some 
cases, reflect a "rush to judgment” and identification of 
"false positives," or children who are incorrectly iden- 
tified and labeled. The standard treatment protocol 
approach, by contrast, tends to provide more intensive 
instruction, to which many children respond posi- 
tively. However, it is also more likely to produce "false 
negatives," or students with disabilities who improve 
during intensive tutoring only to be returned to general 
education where they fail once again. In selecting 
between these two approaches, it may be necessary to 
determine whether one's primary intent is identifica- 
tion or prevention. 

READING DISABILITY AS RTI: 

TWO ASSESSMENT COMPONENTS 

Regardless of which RTI approach is adopted, two 
components of the assessment process must be speci- 
fied. First, methods must be determined for measuring 
students' response to instruction. That is, measures 
must be specified for tracking responsiveness, and so 
must the frequency with which the measures are 
administered. Second, once student responsiveness has 
been quantified, a criterion must be applied for defin- 
ing non-responsiveness. Below such a criterion, stu- 
dents are identified as having reading disabilities. 

Prior Research on Measuring and Defining 
Non-Responsiveness 

Various methods are available for specifying these 
two assessment components. Vellutino et al. (1996) 
tested students on subtests of the Woodcock Reading 
Mastery Tests several times over the course of a multi- 
year study. To establish a cut-point for responsiveness, 
they rank-ordered slopes representing children's 
growth in responsiveness to tutoring, performed a 
"median split" on the slopes, and designated the bot- 
tom half as non-responsiveness. Similarly, Torgesen 
and colleagues (2001) evaluated student performance 
at the end of treatment on the subtests of the 
Woodcock Reading Mastery Tests, designating non- 
responsiveness as failing to achieve "normalized” sta- 
tus; that is, a word-reading standard score of 90 or 


better. Finally, Good, Simmons, and Kame'enui (2001), 
like Torgesen et al., also specified non-responsiveness 
in terms of posttreatment status. However, their 
approach involves a criterion-referenced "benchmark" 
associated with future reading success. 

Speece and Case (2001) took yet a different tack. 
They adopted frequent measurement using CBM so 
that non-responsiveness could be identified earlier in 
the school year than was possible with the Vellutino et 
al., Torgesen et al., or Good et al. methods. Speece and 
Case applied a "dual discrepancy” criterion. Non- 
responders were students whose slope and level of per- 
formance fell at least 1 standard deviation below their 
class mean. This dual-discrepancy approach could also 
be determined with respect to school, district, or 
national norms or using benchmark cut-points associ- 
ated with future school success. 

Many other options exist for measuring and defining 
students' non-responsiveness to instruction. Unfortu- 
nately, few studies have explored these alternatives. 

Our Research on Measuring and Defining 
Non-Responsiveness 

To provide information about how to identify 
responders and non-responders, we retrospectively 
analyzed the data from two reading intervention stud- 
ies - both designed in parallel fashion, involving a stan- 
dard treatment protocol. One study was conducted in 
first grade; the other in second grade. In each grade, we 
identified students for intensive tutoring in a rather 
unique manner. Instead of assessing and identifying 
them in the beginning of the school year, 20 first- 
and second-grade teachers implemented Peer-Assisted 
Learning Strategies (PALS; Fuchs & Fuchs, in press; 
Fuchs, Fuchs, Mathes, & Simmons, 1997; Fuchs et al., 
2001), a validated classroom-based reading program. In 
each of the 20 classes, we designated a subset of chil- 
dren as at risk based on beginning-of-school-year 
screenings: approximately 40% of the full sample of 
first-graders (the lowest eight students per class on 
letter naming fluency) and 30% of second-graders 
(the lowest six students per class on CBM). We moni- 
tored these at-risk students' responsiveness to the PALS 
program. We also monitored the responsiveness of 
typically achieving children. At the end of the first 
semester, we identified non-responsive students whose 
performance was substantially below that of classroom 
peers. 

To monitor progress at grade 1, we collected weekly 
data in two areas: word identification and word attack. 
To index word identification, we measured students on 
alternate forms of the Dolch word list, where students 
had 1 minute to read high-frequency words. To track 
the development of word attack skills, we measured 
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students on alternate forms of the nonsense word flu- 
ency measure (see the Dynamic Indicators of Basic 
Early Literacy Skills; DIBELS; Good et al., 2001). With 
nonsense word fluency, students are given lists of con- 
sonant-vowel-consonant pseudo-words and are 
instructed to say sounds or decode the pseudo-word. 
The score is the number of sounds read correctly (with 
three sounds awarded for a correctly decoded pseudo- 
word) in 1 minute. Second-graders' reading develop- 
ment was monitored with CBM oral reading fluency 
(Deno, 1985). 

Using these progress-monitoring data, we calculated 
dual discrepancies relative to classroom peers and the 
entire experimental sample. Students who were non- 
responsive to PALS were at least .5 standard deviations 
below the reference groups on both measures in grade 
1 and on CBM in grade 2. Using this method, we iden- 
tified 54 first-graders and 64 second-graders requiring 
additional attention. This represented about 13% and 
10% of the experimental groups in first and second 
grade, respectively. These children were then assigned 
randomly to intensive tutoring or to continue in PALS. 
The subset of students assigned to intensive tutoring, 
36 of the 54 first-graders and 48 of the 64 second- 
graders, are the children on whom we conducted the 
analyses described below. 

At both grade levels, the tutoring activities addressed 
phonological awareness, letter-sound recognition, 
decoding, sight-word recognition, fluency building, 
and sentence and story reading. Tutoring was con- 
ducted for 10-12 weeks, 30-35 minutes per session. At 
grade 1, the one-to-one sessions were conducted three 
times a week. At grade 2, students were assigned ran- 
domly to small-group instruction or individual tutor- 
ing, which, in either case, was conducted four times 
a week. Throughout the tutoring, the weekly progress 
monitoring continued. 

Below, we describe additional study procedures and 
summarize the findings separately for the grade 1 and 
grade 2 databases. These analyses were conducted retro- 
spectively. Therefore, our methods for judging instruc- 
tional responsiveness, and our strategies for assessing 
the validity of the methods, were limited to variables in 
the database. The reader should be mindful that these 
analyses address responsiveness to an intensive stan- 
dard treatment protocol conducted during the second 
semester, not to the implementation of PALS in the 
general education classroom during the first semester. 

First-Grade Study Procedures and Findings 

Study procedures. At grade 1, responsiveness to a 
standard treatment protocol was judged using four 
methods. The first two were modeled after Vellutino et 
al. (1996), using median splits on slopes calculated over 


the course of the tutoring: one on the Dolch weekly 
monitoring data; the other on the nonsense word flu- 
ency weekly monitoring data. The remaining two 
methods were based on students' posttreatment status. 
Using Torgesen et al.'s (2001) framework, one criterion 
for determining responsiveness was achieving "nor- 
malized" posttreatment status; that is, a standard score 
of 90 or greater on the word reading score of the 
Woodcock Reading Mastery Tests. The other posttreat- 
ment status criterion was based on the DIBELS's year- 
end first-grade benchmark of 40 words read correct 
from text in 1 minute (Good et al., 2001). We refer to 
these four methods of assessing responsiveness, respec- 
tively, as (a) Dolch slope median split, (b) nonsense 
word fluency slope median split, (c) normalized post- 
treatment status, and (d) benchmark posttreatment 
status. 

To explore the validity of these methods, we created 
responsive and non-responsive groups using each 
method. Then, for each method, we contrasted the out- 
come (May) performance and amount of growth (May 
raw score minus September raw score) of the responsive 
and non-responsive groups on the various reading 
measures in our extant database. Our assumption was 
that the more valid and preferred methods for judging 
instructional responsiveness would better differentiate 
the outcomes and growth of the responsive and non- 
responsive groups. For the May outcome performance, 
we examined students' (a) standard scores on the 
Woodcock Reading Mastery Tests (Word Identification 
and Word Attack), (b) spelling standard scores on the 
Wechsler Individual Achievement Test, and (c) fluency 
and (d) comprehension raw scores on the Comprehen- 
sive Reading Assessment Battery. The Comprehensive 
Reading Assessment Battery requires students to read 
two 400-word passages aloud. After reading each pas- 
sage, students answer 10 short-answer questions that 
address idea units of high thematic importance. 

Findings. The proportion of tutored children desig- 
nated non-responsive was 47.2 for the Dolch slope 
median split, 47.2 for the nonsense word fluency slope 
median split, 16.7 for the normalized posttreatment 
status, and 100 for the benchmark posttreatment sta- 
tus. By design, the median split methods identified 
approximately half the tutored sample, which trans- 
lates into 3.5% of the full experimental sample. The 
two posttreatment status methods resulted in dramati- 
cally different prevalence rates of non-responders: 1.4% 
of the full experimental sample for normalized post- 
treatment status vs. 8.4% for benchmark posttreatment 
status. Normalized posttreatment status proved the 
most lenient criterion (i.e., lowest proportion of non- 
responders), whereas the benchmark posttreatment 
criterion was the most stringent criterion (i.e., highest 
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proportion of non-responders). Effect sizes and statisti- 
cal significance (represented by asterisks) are shown in 
Figure 1. 

In terms of how well the alternative methods differ- 
entiated responders' and non-responders' outcomes 
and growth, the two slope criteria performed differ- 
ently (see Figure 1). Dolch slope median split fared rel- 
atively well, identifying responsive and non-responsive 
groups that performed statistically significantly differ- 
ently, with large effect sizes, on every (May) outcome 
variable and on every (September to May) growth vari- 
able. The average effect size for outcomes was 1.00 stan- 
dard deviation; for growth, it was 1.19. On the 
comprehension outcome, the effect size was .90. By 


contrast, nonsense word fluency slope median split 
functioned poorly, distinguishing responsive and non- 
responsive groups on only one outcome (text reading 
fluency) and on none of the growth measures. The 
average effect size for the outcome variables was .43; for 
growth, it was .36. The effect size for the comprehen- 
sion outcome was .54. 

Consequently, it seems that first-graders' slope on 
sight word recognition of Dolch high-frequency words 
may be a more valid overall indicator of first-graders' 
responsiveness to an intensive standard treatment pro- 
tocol than their performance on nonsense word fluency 
tasks, which required decoding of closed-syllable 
pseudo-words. Of course, findings may be specific to the 


Figure 1. Effect sizes distinguishing responders from non-responders by classification criteria and 
measures in grade 1. Outcome measures are the Woodcock Reading Mastery Test - Word 
Identification (WID) and Word Attack (WAT); Wechsler Individual Achievement Test - Spelling 
(SPEFF); and Comprehensive Reading Assessment Battery - Fluency (F) and Comprehension (C). 
Growth measures are the same minus the Comprehensive Reading Assessment Battery. 
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measures we used for monitoring responsiveness. Some 
work (Morgan & Young, 2002) tentatively suggests tech- 
nical problems for nonsense word fluency slope, with 
the relation between it and other indicators of decoding 
competence decreasing over the course of treatment. 
Future studies should continue to explore the technical 
properties of nonsense word fluency slope. 

In terms of posttreament status, the normalized post- 
treatment status criterion fared better than the bench- 
mark posttreatment status, as indicated in Figure 1. 
Judging responsiveness in terms of whether students 
achieved a standard score of 90 or better discriminated 
responsive students from non-responsive students on 
four of five outcomes (all but comprehension), and on 
two of three growth scores (word attack and spelling, 
but not word identification). Effect sizes were large, 
with averages of 1.59 for outcome and 1.05 for growth. 
By contrast, use of the DIBELS's benchmark criterion of 
40 words read correctly from text in 1 minute (Good et 
al., 2001) resulted in no student being judged respon- 
sive. Hence, no data are presented for the benchmark 
criterion in Figure 1. While in principle, it is possible 
that the tutoring treatment was ineffective, this possi- 
bility is weakened by the competing responsiveness 
assessment methods. It is more likely that the DIBELS 
benchmark criterion was too stringent to discriminate 
responders from non-responders, at least when assess- 
ing responsiveness to an intensive standard treatment 
protocol for an initially very low-performing sample. 

As mentioned, this database and retrospective series 
of analyses were limited to the variables selected for our 
studies. Investigators planning to prospectively explore 
the validity of alternative methods of judging treat- 
ment responsiveness at first grade would be well 
advised to include CBM's oral reading fluency in the 
second semester to monitor progress and to judge 
responsiveness. It is unfortunate that the available 
database cannot be used to examine the utility of CBM 
slope. 

In summarizing, it seems useful to compare the bet- 
ter of the two methods for judging responsiveness 
based on slope (i.e., Dolch) to the better of the two 
methods for judging responsiveness based on posttreat- 
ment status (i.e., normalized posttreatment status). In 
this comparison, Dolch slope median split fared better 
than normalized posttreatment status in terms of the 
consistency with which it differentiated the perform- 
ance of responsive students from that of non-respon- 
sive students. Using the Dolch approach, effects were 
statistically significant on every measure. Normalized 
posttreatment status, by contrast, failed to reliably dis- 
criminate end-of-year comprehension performance and 
word identification growth. Effect sizes were greater for 
normalized posttreatment status than for Dolch slope 


on outcome, but not on growth, variables. These two 
methods of judging responsiveness appear valid and 
might be used in a coordinated fashion in first grade. 
Future research should examine this possibility. 

Second-Grade Study Procedures and Findings 

Study procedures. In the grade 2 database, respon- 
siveness to an intensive standard treatment protocol 
was judged in six ways. The first two methods were 
modeled after Vellutino et al.'s (1996) median split: 
one on the Woodcock word-reading gain scores; the 
other on CBM slope. The next two methods were based 
on posttreatment status. Using Torgesen et al.'s (2001) 
framework, one of these methods was "normalized" 
posttreatment status, indicated by a standard score of 
90 or better on the word-reading score of the 
Woodcock Reading Mastery Tests. The second post- 
treatment status method relied on a CBM year-end 
grade 2 benchmark of at least 75 words read correctly 
from text in 1 minute. Our final two methods were also 
based on CBM performance: a normative criterion for 
expected CBM slope at grade 2 (i.e., 1.5 words' increase 
per week) and a combination of this CBM slope crite- 
rion and the benchmark CBM performance of 75 words 
correct at the end of treatment. As specified by Fuchs 
(1995), this last dual-discrepancy criterion designated 
students as non-responsive only if they failed to meet 
both criteria. In other words, if either growth rate 
or performance level was adequate, students were 
deemed responsive. We refer to these six methods for 
judging responsiveness, respectively, as Woodcock 
word reading gain median split, CBM slope median 
split, normalized posttreatment status, benchmark 
posttreatment status, normative CBM slope, and dual 
discrepancy. 

The following reading outcomes were available to 
examine differences between responsive and non- 
responsive groups at grade 2. For May outcomes, the 
database included Word Identification and Word 
Attack standard scores on the Woodcock Reading 
Mastery Tests, spelling standard scores on the Wechsler 
Individual Achievement Test, and fluency and compre- 
hension raw scores on the Comprehensive Reading 
Assessment Battery. For September-to-May growth (cal- 
culated as raw score gain), we used Word Identification 
and Word Attack scores for the Woodcock Reading 
Mastery Tests, spelling performance on the Wechsler 
Individual Achievement Test, and fluency and compre- 
hension scores on the Comprehensive Reading 
Assessment Battery. 

Findings. The proportion of second-graders desig- 
nated as non-responders was 43.7 for word reading 
gain median split, 50.0 for CBM slope median split, 
45.8 for normalized posttreatment status, 91.7 for 
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benchmark posttreatment status, and 29.2 for norma- 
tive CBM slope and 29.2 for dual discrepancy. 
Normative CBM slope and dual discrepancy identified 
the same pool of students due to the stringency of the 
CBM benchmark posttreatment status criterion. 
Nevertheless, across the remaining classification meth- 
ods, different proportions of students were identified as 
non-responsive. For example, the median split meth- 
ods, by design, identified approximately half the sam- 
ple (or 3.5% to 3.8% of the full experimental group), 
whereas the two posttreatment status methods resulted 
in different prevalence rates: 3.5% of the entire sample 
for normalized posttreatment status versus 7.0% for 
CBM benchmark posttreatment status. As with our 
first-grade study, therefore, the CBM benchmark post- 
treatment criterion represented a much more stringent 
criterion. The normative CBM slope and dual discrep- 
ancy identified the fewest students as non-responsive 
(1.4% of the full experimental group). This finding sug- 
gests that these initially very low-performing second- 
grade students grew more during tutoring than their 
final status might suggest. It also questions the validity 
of basing responsiveness criteria exclusively on post- 
treatment status. Thus, five methods are displayed. 

In Figure 2, we present effect sizes and statistical 
significance (represented by asterisks) on the reading 
outcome and growth variables for the responder/non- 
responder groups as a function of classification 
method. The data for the normative CBM slope and 
dual-discrepancy methods are provided together 
because, as mentioned, the two methods identified 
identical groups of children. 

As illustrated, the CBM slope median split produced 
stronger differentiation between responsive and non- 
responsive groups than the word-reading gain median 
split. The responsive and non-responsive groups 
formed by the CBM slope median split performed 
statistically significantly differently on three of five 
outcome variables (word identification, fluency, and 
comprehension, but not on word attack or spelling) 
and on two of five growth variables (fluency and com- 
prehension, but not on word identification, word 
attack, or spelling). The average effect sizes were large: 
.94 for outcome and 1.20 for growth, with impressive 
effect sizes of 1.53 and 1.20 on comprehension out- 
come and comprehension growth, respectively. By con- 
trast, the Woodcock Word Identification gain median 
split resulted in differential performance on only the 
Word Attack outcome variable and on only the two 
Woodcock growth variables. Effect sizes were also very 
modest, with a mean of .01 for the outcome variables 
and .43 for the growth measures. Notably, effect sizes 
for the comprehension measures were in the wrong 
direction (-.34 for outcome and -.35 for growth). 


The next two methods for designating respon- 
sive/non-responsive groups were based on posttreat- 
ment status: Torgesen et al.'s (2001) cut-point of 90 or 
higher on word reading and the second-grade CBM 
benchmark of at least 75 words read correctly from text 
in 1 minute. These two posttreatment status methods 
performed comparably well, although they differenti- 
ated responders and non-responders on different vari- 
ables. Specifically, the normalized posttreatment 
word-reading method distinguished the two groups 
on word identification, word attack, and spelling out- 
come variables and on the word attack growth score. 
Mean effect sizes were 1.22 for outcome and .52 for 
growth, with corresponding effect sizes of .52 and .40 
for comprehension. 

By contrast, the CBM benchmark discriminated the 
groups on fluency and comprehension outcome vari- 
ables as well as on the fluency growth score. Effect sizes 
were similar to those for normalized posttreatment sta- 
tus: 1.05 for outcome and .41 for growth. Although 
effect sizes for growth in comprehension were identical 
across the two posttreatment methods (.40), the com- 
prehension outcome effect sizes were notably larger for 
CBM benchmark posttreatment status (1.63) than for 
normalized posttreatment status (.52). Whereas neither 
of the posttreatment status methods fared as well as the 
CBM slope median split, it should be noted that only 4 
of the 36 students met the CBM benchmark criterion. 
This raises questions about the stringency of the CBM 
benchmark when used to identify non-responsiveness 
to intensive tutoring. These findings resemble those of 
the first-grade database. 

The last classification method, also a variation of 
CBM, employed a dual discrepancy for unresponsive- 
ness: growth less than 1.5 words per week and a post- 
treatment level of performance below the benchmark 
of 75 words read correctly. The CBM slope criterion 
produced the lowest percentage of unresponsive stu- 
dents: 29.2% (or 2.2% of the total experimental sam- 
ple), as opposed to 43.7% for word reading gain median 
split (3.5% of the experimental sample), 50.0% for 
CBM slope median split (3.8% of the experimental 
sample), 45.8% for normalized posttreatment status 
(3.5% of the experimental sample), and 91.7% for CBM 
benchmark posttreatment performance (7.0% of the 
experimental sample). Thus, when compared to typi- 
cally performing students' responsiveness to general 
education, many tutored students demonstrated 
respectable rates of improvement, suggesting an 
absence of disability among many of the students even 
though they failed to achieve posttreatment criteria for 
adequate performance. As the benchmark associated 
with a good prognosis increases with each grade, ques- 
tions arise about whether these children must remain 
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in intensive tutoring and, if so, for what length of time, 
and what resources might pay for the service. These 
conceptual and policy issues should be considered care- 
fully before an RT1 framework for reading disability 
classification is complete. 

In any case, normative CBM slope/dual discrepancy 
fared well in terms of the consistency and magnitude of 
effects in discriminating responsive from non-respon- 
sive students. This classification method produced sta- 
tistically significant effects on five variables: two 
outcomes (fluency and comprehension) and three 
growth measures (word attack, fluency, and compre- 
hension). Average effect sizes were large: .85 for out- 
come variables and .84 for growth. Effect sizes for 
comprehension outcome and growth were 1.15 and 
1.05, respectively. 

Two additional points are worth noting. First, as 
mentioned, the dual-discrepancy method resulted in 
groups identical to those identified based on slope 
alone. This was because few students achieved the post- 
treatment CBM benchmark of at least 75 words read 
correctly in 1 minute. Consequently, the dual criterion 
was unnecessary; normative slope served to differenti- 
ate the groups. Second, dual discrepancy fared no 
better than the CBM slope median split. The dual- 
discrepancy method, as conceptualized by Fuchs and 
Fuchs (1998) and studied by Speece and Case (2001), 
establishes criteria for slope and level relative to those 
of classroom peers, not with respect to the broad, nor- 
mative framework used for the present analysis. 
Therefore, we cannot comment on the reasonableness 
of cut-scores framed with reference to the local context. 
Moreover, lower benchmark cut-points employed 
within a dual-discrepancy approach would have pro- 
duced different groups of students from those based on 
a focus on only normative CBM slope. It would be 
interesting to determine a CBM benchmark that actu- 
ally forms different groups for the two approaches and 
to explore how the two groups differ. 

CONCLUSIONS 

These findings are preliminary because of small sam- 
ple sizes and the retrospective nature of the analyses. 
Findings require corroboration with larger samples fol- 
lowed prospectively and longitudinally across the pri- 
mary grades to investigate long-term outcomes. For 
now, we tentatively draw several conclusions across our 
two databases. 

First, alternate methods of assessing responsiveness 
produce different prevalence rates of reading disability 
and different subsets of unresponsive children. This 
is important because a major criticism of IQ-achieve- 
ment discrepancy as a method of FD identification is 
the unreliability of the diagnosis. Practitioners relying 


on an assortment of assessment procedures in an RTI 
framework may produce similarly unreliable diagnoses. 
To develop more consistent identification procedures, 
researchers must explore the soundness of various 
methods. At the same time, however, different assess- 
ment methods demonstrate differential utility in dis- 
tinguishing responsive and non-responsive groups 
on different components of beginning reading. For 
this reason, consistency in identifying non-responders 
across the various components of beginning reading 
skill is an important criterion for selecting a valid 
assessment approach. Among the alternatives we ex- 
plored, Dolch slope median split was the clear winner 
in terms of its consistency in grade 1 . Thus, it discrim- 
inated responsive/non-responsive groups on all five 
outcome variables and all three growth variables. At 
second grade, no approach differentiated responders 
from non-responders on all outcome and growth vari- 
ables. However, CBM slope median split and normative 
CBM slope/dual discrepancy fared best with respect to 
consistency. Thus, CBM slope median split differenti- 
ated the two groups on three of five outcome variables 
and two of five growth variables. Normative CBM 
slope/dual discrepancy differentiated the groups on 
two of five outcome and three of five growth variables. 

Second, CBM benchmark posttreatment status (as 
defined in our analyses) was a considerably more strin- 
gent criterion than the other methods. It did not pro- 
duce a single responder at grade 1, and only four 
responders at grade 2. The question is whether the cut- 
points of 40 words read correctly per minute at grade 1 
and 75 words read correctly per minute at grade 2 are 
too high to define responsiveness to intensive standard 
treatment protocols. The answer might depend on how 
students are selected to participate in intensive tutor- 
ing. In our work, children identified for tutoring had 
already demonstrated poor responsiveness during an 
entire semester of PALS, a validated classroom reading 
program. In others' work, children have been chosen 
for tutoring based on September screening scores. 
September screening will surely produce more false 
positives for risk status (Jenkins & O'Connor, 2002). 
With a higher proportion of false positives in the tutor- 
ing treatment, a better rate of responsiveness, and more 
defensible grounds for use of CBM posttreatment 
benchmarks, can be predicted. 

A third conclusion drawn across the first- and sec- 
ond-grade studies concerns the use of posttreatment 
status as a means of indexing responsiveness. As repre- 
sented by Torgesen et al.'s (2001) cut-point of a stan- 
dard score of 90 or better on the Woodcock Word 
Identification score, normalized posttreatment status 
differentiated responsiveness from non-responsiveness 
on posttreatment outcome measures better than on 
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growth measures. This finding should come as no sur- 
prise given that judging responsiveness by means of 
posttreatment status fails to consider amount of learn- 
ing. At the same time, Dolch slope (at grade 1) and nor- 
mative CBM slope (at grade 2) differentiated responsive 
from non-responsive students' performance equally 
well on outcome and growth variables, suggesting the 
potential utility of slope as an index of responsiveness. 
Of course, in these analyses, outcome and growth were 
defined within a short timeframe. The real key is for- 
mulating optimal cut-points to identify the children 
who fare worst over the course of their educational 
experience, and for whom reading, especially reading 
for meaning, represents a life-long skill deficit that 
results in poor post-school outcomes. 

Our final conclusions concern reading comprehen- 
sion. In the first-grade database, Dolch median split 
produced the largest difference between responsive/ 
non-responsive groups on comprehension, where only 
outcome (not growth) information was available. At 
second grade, CBM slope median split and benchmark 
posttreatment status yielded the largest between-group 
differences on comprehension outcome; CBM slope 
median split and normative CBM slope/dual discrep- 
ancy produced the largest between-group differences 
on comprehension growth. At grade 2, monitoring stu- 
dent responsiveness with CBM was clearly superior to 
Woodcock Word Identification in terms of its corre- 
spondence to reading comprehension, at least as oper- 
ationalized in these studies. 

Rather than regarding these conclusions as written in 
stone, we offer them as reasonable hypotheses with 
which to begin prospective, systematic, and longitudi- 
nal research on the utility of alternative assessments in 
an RTI framework. At least three major components of 
such assessments need to be examined. First, research 
should explore how classification varies as a function 
of the nature of the treatment. It is likely that the cri- 
teria by which reading disability is predicted will 
require different cut-points when responsiveness is 
assessed in general education versus in intensive tutor- 
ing. In addition, keeping the nature of treatment con- 
stant, researchers must give serious thought to how 
children enter responsiveness assessment. The utility of 
alternative approaches to assessment is likely to vary as 
a function of entry criteria. 

The second component of future research concerns 
the nature of the measures used and the frequency 
of assessment. A third component addresses the criteria 
applied to define unresponsiveness. As demonstrated 
in the analyses of our first-grade and second-grade 
databases, different measurement systems using differ- 
ent criteria result in identification of different groups 
of students. The critical question is which combination 


of assessment components is most accurate for iden- 
tifying children who will experience serious and 
chronic reading problems that prevent reading for 
meaning in the upper grades and impair their capacity 
to function successfully as adults. At this point, rela- 
tively little is known to answer this question when RTI 
is the assessment framework. 
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